My data comes from Kaggle. License is Public Domain.https://www.kaggle.com/mysarahmadbhat/120-years-of-olympic-history The dataset is historical data from the modern Olympic Games, from Athens in 1896 to Rio de Janeiro in 2016. The data includes the athlete’s name, gender, age, height, weight, country, and medal, as well as the project name, sport, game, year, and city, with a unique number for each athlete. ### Data loading, full setup, and overall database data
Load libraries:
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Load the original database:
Olympic1 <-read.csv("/Users/liuyu/Documents/R\ File/athlete_events.csv")
Olympic1
Analyze the data with str( ):
str(Olympic1)
## 'data.frame': 271116 obs. of 15 variables:
## $ ID : int 1 2 3 4 5 5 5 5 5 5 ...
## $ Name : chr "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
## $ Sex : chr "M" "M" "M" "M" ...
## $ Age : int 24 23 24 34 21 21 25 25 27 27 ...
## $ Height: int 180 170 NA NA 185 185 185 185 185 185 ...
## $ Weight: num 80 60 NA NA 82 82 82 82 82 82 ...
## $ Team : chr "China" "China" "Denmark" "Denmark/Sweden" ...
## $ NOC : chr "CHN" "CHN" "DEN" "DEN" ...
## $ Games : chr "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
## $ Year : int 1992 2012 1920 1900 1988 1988 1992 1992 1994 1994 ...
## $ Season: chr "Summer" "Summer" "Summer" "Summer" ...
## $ City : chr "Barcelona" "London" "Antwerpen" "Paris" ...
## $ Sport : chr "Basketball" "Judo" "Football" "Tug-Of-War" ...
## $ Event : chr "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
## $ Medal : chr NA NA NA "Gold" ...
There are 27,116 obs and 15 variables consisting of 10 char, 4 int and 1 numeric. Player metrics include “ID,” “Name,” “Sex,” “Age,” “Height,” “Weight,” “Team,” “Sport,” “Event,” “NOC,” “Medal” and Olympic metrics include “City,” “Season,” “Games,” “Year”.
Analyze the data with summary( ):
summary(Olympic1)
## ID Name Sex Age
## Min. : 1 Length:271116 Length:271116 Min. :10.00
## 1st Qu.: 34643 Class :character Class :character 1st Qu.:21.00
## Median : 68205 Mode :character Mode :character Median :24.00
## Mean : 68249 Mean :25.56
## 3rd Qu.:102097 3rd Qu.:28.00
## Max. :135571 Max. :97.00
## NA's :9474
## Height Weight Team NOC
## Min. :127.0 Min. : 25.0 Length:271116 Length:271116
## 1st Qu.:168.0 1st Qu.: 60.0 Class :character Class :character
## Median :175.0 Median : 70.0 Mode :character Mode :character
## Mean :175.3 Mean : 70.7
## 3rd Qu.:183.0 3rd Qu.: 79.0
## Max. :226.0 Max. :214.0
## NA's :60171 NA's :62875
## Games Year Season City
## Length:271116 Min. :1896 Length:271116 Length:271116
## Class :character 1st Qu.:1960 Class :character Class :character
## Mode :character Median :1988 Mode :character Mode :character
## Mean :1978
## 3rd Qu.:2002
## Max. :2016
##
## Sport Event Medal
## Length:271116 Length:271116 Length:271116
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
The amount of data is very large, each player will correspond to different variables, and each Olympic Games will have different variables, choose different indicators will get different data, and the source data is extremely complex.
Analyze the data with head( ):
head(Olympic1)
Count the number of NA for each classification:
apply(is.na(Olympic1), 2, sum)
## ID Name Sex Age Height Weight Team NOC Games Year Season
## 0 0 0 9474 60171 62875 0 0 0 0 0
## City Sport Event Medal
## 0 0 0 231333
“Medal” has a lot of NA we understand, because only the most powerful top three have medals, most people do not have medals, and “Age”, “Height”, “Weight” may be due to earlier years of data loss or not written, the use of these variables later to pay attention to remove NA values or other.
To make a data frame with “Season,” “Year,” “Game,” “Number_of_Participants” to make it easier for us to explore the annual changes in the number of people for each summer and winter Olympics
Data cleaning and creation:
Olympic3<- Olympic1[!duplicated(Olympic1[,c("Season","Year")]),]
Olympic3 <-subset(Olympic3,select=c(Season,Year))
Olympic3 <- Olympic3 [order(Olympic3 [,"Year"],decreasing=F),]
Olympic4 <- Olympic1[!duplicated(Olympic1[,c("ID","Games")]),]
Olympic4 <-subset(Olympic4 ,select=c(Year,Games,Season))
ID_number <- group_by(Olympic4 , Games)
History_Number<- summarise(ID_number,Number_of_Participants = n())
rownames(Olympic3) <-NULL
rownames(Olympic3) <- rownames(Olympic3, do.NULL = FALSE)
Olympic3 <- Olympic3 [order(Olympic3 [,"Year"],decreasing=F),]
Olympic3 <- arrange(Olympic3,Olympic3[,"Year"],Olympic3[,"Season"])
History_Number<- cbind(Olympic3,History_Number)
rownames(History_Number) <-NULL
rownames(History_Number) <- rownames(History_Number, do.NULL = FALSE)
#The new data frame is as follows: #
History_Number #new data frame #
ggplot(History_Number,aes(x = Year, y= Number_of_Participants,color=Season))+geom_bar(stat="identity",fill="white")+geom_smooth(method = 'loess',formula='y ~ x',fill="white",alpha=.100)+ggtitle("Number of Olympic Games")
The new data frame details the number of people corresponding to each Olympic Games obtained after the source data is cleaned. First of all, the overall number of people in all Winter Olympics and Summer Olympics is on the rise. Secondly, the number of participants in the Summer Olympics is far greater than that of the Winter Olympics. Finally, the data in the middle paragraphs are empty due to the influence of the world war at that time.
We need to get all the Sports categories and all theevents specific to each sport. Data cleaning and creation:
#Data cleaning#
Topic_Sport<-subset(Olympic1,select=c(Sport,Event)) #Extract related categories#
Topic_Sport<- Topic_Sport[!duplicated(Topic_Sport[,c("Event")]),] #Removes heavy duplicate data#
rownames(Topic_Sport) <- NULL
All_Sport <- Topic_Sport [order(Topic_Sport [,"Sport"],decreasing=F),]
rownames(All_Sport) <- NULL
rownames(All_Sport) <- rownames(All_Sport, do.NULL = FALSE)
#Olympic Games all Events#
All_Sport
ID_number <- group_by(Topic_Sport , Sport)
Topic_Sport2 <- summarise(ID_number,Count = n())
Topic_Sport2 <- data.frame(Topic_Sport2)
#The number of Events per Sport#
Topic_Sport2
#Drawing#
bar <- ggplot(data = Topic_Sport) +
geom_bar(mapping = aes(x = Sport, y=..count..,fill = Sport),show.legend = FALSE,width = 1) +theme(aspect.ratio = 1) +labs(y = NULL)
p1 <- bar + coord_flip()+ theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4),axis.text.y = element_text(angle = 0, size = 5.2))
p2 <- bar + coord_polar()+ theme(axis.text.x = element_text(angle = 0, hjust = 1, size = 2.8),axis.text.y = element_text(angle = 0, size = 0))
p1+theme(aspect.ratio = 1) #All Events#
#Select the top 10#
Topic_Sport3 <- Topic_Sport2 [order(Topic_Sport2 [,"Count"],decreasing=T),]
Top10_Sport <- Topic_Sport3 [1:10,]
rownames(Top10_Sport) <-NULL
rownames(Top10_Sport) <- rownames(Top10_Sport, do.NULL = FALSE)
Top10_Sport
#Drawing#
Top10_Sport <- Top10_Sport [order(Top10_Sport[,"Count"],decreasing=F),]
bar2 <- ggplot(Top10_Sport,aes(x = Sport, y= Count,fill=Sport))+geom_bar(stat="identity")+scale_x_discrete(limits=factor(Top10_Sport[,1]))
p3 <- bar2 +coord_flip()+theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 10),
axis.text.y = element_text(angle = 0, size = 10))
p4 <- bar2 + coord_polar()+theme(axis.text.x = element_text(angle = 0, hjust = 1, size = 0),
axis.text.y = element_text(angle = 0, size = 10))
p3 #Top10_Sport #
p4 #Top10_Sport #
After data cleaning of the source data, we have obtained a large amount of clear new data. The first data frame shows all the Olympic sports; the second data frame shows in detail how many events are included in all Olympic sports; the third data frame shows the top 10 Olympic sports. First of all, the first picture vividly shows the number of events corresponding to each Olympic sport. We can see that items such as “Shooting” and “Athletics” are among the best.Secondly, in the second picture, we get the number of items corresponding to the top 10 sports by filtering and sorting the top 10 sports. Finally, the last picture shows their approximate proportions and numbers. #### (3).Summer Olympic and Winter Olympic Explore the total number of people who have participated in the Summer and Winter Olympics since the 120s Data cleaning and creation:
#Data cleaning#
Topic_Season<-subset(Olympic1,select=c(ID,Season)) #Select useful columns#
Topic_Season<- Topic_Season[!duplicated(Topic_Season[,c("ID","Season")]),] #Removes heavy duplicate data#
rownames(Topic_Season) <- NULL
rownames(Topic_Season) <- rownames(Topic_Season, do.NULL = FALSE) #format#
ID_number <- group_by(Topic_Season, Season)
Summer_Winter <- summarise(ID_number,Count = n())
Summer_Winter #Summer and Winter#
#Drawing#
bar <- ggplot(data = Topic_Season) +geom_bar(mapping = aes(x = Season, fill = Season),show.legend = FALSE,width = 1)+theme(aspect.ratio = 1)+labs(x = NULL, y = NULL)+ggtitle("Summer Olympic and Winter Olympic")
p1 <- bar
p2 <- bar + coord_polar()+theme(axis.text.x = element_text(angle = 0, hjust = 1, size = 6),
axis.text.y = element_text(angle = 0, size = 10),
axis.title.x = element_blank(),
axis.title.y = element_blank())
grid.arrange(p1, p2, ncol=2) #Summer and Winter#
ID_number <- group_by(Topic_Season, Season)
qwe <- summarise(ID_number,Count= n())
Label = as.vector(qwe$Season) #Turn into a vector#
Label1 = paste(Label, "(", round(qwe$Count/sum(qwe$Count)*100, 3), "%)", sep = "") #Keep three decimal places on the result#
p <- ggplot(qwe, aes(x = '', y = Count, fill = Season))+
geom_bar(stat = 'identity', width = 1) + #The noise in the center disappears when the width > 1#
coord_polar(theta = 'y')+
labs(x = "", y = "", title = "") +
theme(axis.ticks = element_blank()) + # Remove the scale of the top left border#
scale_fill_discrete(breaks = qwe$Season, labels = Label1)+
theme(legend.title = element_blank())+
theme(axis.text.x = element_blank())+ggtitle("Summer Olympic and Winter Olympic")
p #Proportion#
By cleaning the data, we have obtained a new table that clearly shows the number of participants in the Summer Olympics and Winter Olympics over the past 120 years. First of all, the first picture shows that the number of participants in the Olympics is far more than the Winter Olympics through the comparison of histograms. The last picture shows the proportion of participants in the Summer and Winter Olympics, which accounted for 86% of the total number of participants in the 120 years. ### [2]Topic of Olympic athletes
Explore the differences in height and weight between Olympic athletes of different genders and the number of men and women participating Data cleaning and creation:
#Data Cleaning#
#Select useful information in the source data#
Topic_Sex <-subset(Olympic1,select=c(ID,Height,Weight,Sex))
Topic_Sex<- Topic_Sex[!duplicated(Topic_Sex[,c("ID")]),] #Remove repeat contestants#
rownames(Topic_Sex) <- NULL #Initialize the data#
rownames(Topic_Sex) <- rownames(Topic_Sex, do.NULL = FALSE)
Topic_Sex #Each player's height, weight, gender#
#Drawing#
Topic_Sex<- na.omit(Topic_Sex)
bar <- ggplot(data = Topic_Sex) + geom_bar(mapping = aes(x = Sex, fill = Sex),
show.legend = FALSE,width = 1) + theme(aspect.ratio = 1) + labs(x = NULL, y = NULL)+ggtitle("Gender differences")
p1 <- bar + coord_flip() #Swap coordinate systems#
p2 <- bar + coord_polar()
grid.arrange(p1, p2, ncol=2) #Total number#
ID_number <- group_by(Topic_Sex,Sex)
Data <- summarise(ID_number,Count= n())
Label = as.vector(Data$Sex) #Turn into a vector#
Label1 = paste(Label, "(", round(Data$Count/sum(Data$Count)*100, 3), "%)", sep = "") #Keep three decimal places on the result#
p3 <- ggplot(Data, aes(x = '', y = Count, fill = Sex))+
geom_bar(stat = 'identity', width = 1) + #The noise in the center disappears when the width > 1#
coord_polar(theta = 'y')+
labs(x = "", y = "", title = "") +
theme(axis.ticks = element_blank()) + # Remove the scale of the top left border#
scale_fill_discrete(breaks = Data$Sex, labels = Label1)+
theme(legend.title = element_blank())+
theme(axis.text.x = element_blank())+ggtitle("Gender differences")
p3 #The ratio of men to women#
ggplot(Topic_Sex,aes(x=Height,y=Weight,color=Sex),)+geom_point()+ggtitle("Gender differences") #Distribution chart#
ggplot(Topic_Sex,aes(x=Height,y=Weight,color=Sex,fill="white"))+geom_violin(fill="steelblue",alpha=.6)+ggtitle("Gender differences")
## Warning: position_dodge requires non-overlapping x intervals
ggplot(Topic_Sex,aes(x=Height,y=Weight,color=Sex,fill="white"))+geom_boxplot(fill="white")+ggtitle("Gender differences")#Distribution chart#
We obtained a new data frame by reducing and repeating the source data, which showed in detail all the weight, height and gender information corresponding to each player ID. First of all, the first and second pictures show the number and ratio of male and female athletes participating in the Olympic Games for 120 years. We can see that there are far more male athletes than women. Secondly, through the third scatter chart, we found that the overall height and weight of male athletes and female athletes are quite different, which is caused by the human body structure. Finally, through the final violin chart and box chart, we removed many atypical data. We can find that the overall weight of female athletes is around 60kg, while the overall weight of male athletes is around 75kg; the overall height of female athletes is around 170cm, and the overall height of male athletes is around 180cm. about. #### (2)Multi-sport athletes Are there any athletes who participate in many sports alone? Data cleaning and creation:
#Data Cleaning#
Topic_SportNumber<-subset(Olympic1,select=c(ID,Sport))
Topic_SportNumber<- Topic_SportNumber[!duplicated(Topic_SportNumber[,c("ID","Sport")]),]
row.names(Topic_SportNumber) <- NULL
rownames(Topic_SportNumber ) <- rownames(Topic_SportNumber , do.NULL = FALSE)
ID_number <- group_by(Topic_SportNumber, ID)
Types1 <- summarise(ID_number,Types = n())
ID_number <- group_by(Types1 , Types)
Types1 <- summarise(ID_number,Count = n())
Types1 <- data.frame(Types1)
#Accurate data for Multi-sport athletes #
Types1
#Drawing#
ggplot(Types1,aes(Types,Count))+geom_bar(stat="identity",fill = "lightblue", colour = "black")+ggtitle("Multi-sport athletes")+geom_text(aes(label = Count), vjust = -0.3, colour = "red", position = position_dodge(.7), size = 4)
By cleaning the data, we have obtained clear data. As shown in the first data box, there are indeed athletes participating in different types of Olympic sports. Through the display of the histogram, we found that very few remote mobilizations participated in different types of sports in the Olympic Games, and only 6 people participated in 4 events, which is very rare. It can be seen that athletes who are proficient in many different types of Olympic Games are very rare, and most people only participate in a single category of Olympic sports in their lifetime.
Explore the number of times each athlete has participated in the Olympics and the age of retirement over the past 120 years Data cleaning and creation:
#Data Cleaning#
#Filter the useless variables in this exploration#
Olympic2 <- Olympic1[!duplicated(Olympic1[,c("ID","Games")]),]
Olympic2 <- Olympic2[, !colnames(Olympic2) %in% c("Sport","Event","Medal")]
rownames(Olympic2) <-NULL
rownames(Olympic2) <- rownames(Olympic2, do.NULL = FALSE) #The initial line name#
ID_Age<-subset(Olympic2,select=c(ID,Age,Sex))
ID_number <- group_by(ID_Age, ID)
ID_Age <- summarise(ID_number,Number_of_Participants = n(),Retirement_Age = max(Age))
ID_Age <- data.frame(ID_Age)
row.names(ID_Age) <- NULL
rownames(ID_Age ) <- rownames(ID_Age , do.NULL = FALSE)
Sex <- (subset(Olympic2[!duplicated(Olympic2[,c("ID")]),],select=c(Sex)))
rownames(Sex) <- NULL
rownames(Sex) <- rownames(Sex, do.NULL = FALSE)
ID_Age_Sex <- cbind(ID_Age,Sex)
ID_Age_Sex <- na.omit(ID_Age_Sex)
ID_Age_Sex #Parameters: Player ID, number of participants, gender#
#Drawing#
#The relationship between cumulative participation and retirement age#
ggplot(ID_Age_Sex,aes(x=Number_of_Participants,y=Retirement_Age,color=Sex))+geom_boxplot()+ggtitle("Retirement_Age-Participations")
#Retirement age statistics#
ggplot(ID_Age_Sex, aes(x =Retirement_Age,color=Sex)) +geom_histogram(bins = 500,fill="white")+ggtitle("Retirement_Age-Participations")
After cleaning the huge source data, we have obtained a new data frame, which describes in detail the number of times each athlete participated in the Olympic Games and their retirement age. In the first picture, we use box plots to show their median, upper quartile, lower quartile and two marginal values. After excluding many marginal data, we found that most women participated in about 3 times, and men participated in about 7 times, while the retirement age of female athletes and female athletes was mostly around 25 years old. Secondly, the second bar graph shows the overall selection of retirement age for male and female athletes and the overall distribution of the number of participants. ### [3] Topic of country and medal ####(1)The country’s Olympic medals Explore the number of Olympic medals won by countries around the world over the past 120 years Data cleaning and creation:
#Data Cleaning#
Topic_Country1 <- subset(Olympic1,select=c(NOC,Medal))
Topic_Country1 <-na.omit(Topic_Country1 )
rownames(Topic_Country1) <- NULL
rownames(Topic_Country1) <- rownames(Topic_Country1, do.NULL = FALSE)
ID_number <- group_by(Topic_Country1, NOC)
Topic_Country <- summarise(ID_number,All_Medals= n())
Topic_Country <- data.frame(Topic_Country)
rownames(Topic_Country) <- NULL
rownames(Topic_Country) <- rownames(Topic_Country, do.NULL = FALSE)
#ranked medals for 120 years#
Topic_Country
Topic_Country <- Topic_Country [order(Topic_Country [,"All_Medals"],decreasing=T),]
Top10_Country <- Topic_Country [1:10,]
rownames(Top10_Country ) <- NULL #Initialize the data and rearrange the row names in order
rownames(Top10_Country ) <- rownames(Top10_Country , do.NULL = FALSE)
#Select the top 10#
Top10_Country
#Drawing#
Top10_Country <- Top10_Country [order(Top10_Country[,"All_Medals"],decreasing=F),]
bar <- ggplot(Top10_Country ,aes(x = NOC, y= All_Medals,fill=NOC))+geom_bar(stat="identity")+scale_x_discrete(limits=factor(Top10_Country [,1]))
p1 <- bar +coord_flip()+theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 10),
axis.text.y = element_text(angle = 0, size = 10),
axis.title.x = element_blank(),
axis.title.y = element_blank())+ggtitle("Top10-Country")
p2 <- bar + coord_polar()+theme(axis.text.x = element_text(angle = 0, hjust = 1, size = 0),
axis.text.y = element_text(angle = 0, size = 10),
axis.title.x = element_blank(),
axis.title.y = element_blank())+ggtitle("Top10-Country")
p1 #Top10-County#
p2 #Top10-County#
After data cleaning, in the first data frame, we have obtained the number of medals each country has won in the 120-year history of the Olympic Games. The second data frame shows the top 10 countries that have won Olympic medals. According to the ranking of the histogram, we found that the number of gold medals in the United States is extremely high, and the rest of the countries are almost all developed countries. This requires us to explore the strength of the Olympic Games in developed and developing countries. ####(2)Developed and developing countries Explore the differences in the number of medals that developed and developing countries have won at the Olympics over the past 120 years Exploring developed and developing countries, it is well known that there are currently USA, CAN, AUS, NZL, GBR, IRL, FRA, NED, BEL, LUX, GER AUT, SUI, NOR, ISL, DEN, SWE, FIN, ITA, ESP, POR, GRE, SLO, CZE, TCH, MLT, CYP, JPN, KOR, SGP, SR Next, we count the percentage of medals won by developed countries as a percentage of the total number of medals. Data cleaning and creation:
#Data Creating and Data Cleaning#
Developed_Country <- filter(Topic_Country1, NOC == "USA" | NOC == "CAN" | NOC == "AUS"|NOC == "NZL" | NOC == "GBR"|NOC == "IRL" | NOC == "FRA"|NOC == "NED" | NOC == "BEL"|NOC == "LUX" | NOC == "GER"|NOC == "AUT" | NOC == "SUI"|NOC == "NOR" | NOC == "ISL"|NOC == "DEN"| NOC == "SWE"|NOC == "FIN" | NOC == "ITA"|NOC == "ESP" | NOC == "POR"|NOC == "GRE"| NOC == "SLO"|NOC == "CZE"| NOC == "TCH"|NOC == "MLT" |NOC == "CYP"| NOC == "JPN"|NOC == "KOR"| NOC == "SGP"|NOC == "SR" )
Developing_Country <- filter(Topic_Country1, NOC != "USA" & NOC != "CAN" & NOC != "AUS"& NOC != "NZL" & NOC != "GBR"& NOC != "IRL" & NOC != "FRA"& NOC != "NED" & NOC != "BEL"& NOC != "LUX" & NOC != "GER"& NOC != "AUT" & NOC != "SUI" & NOC != "NOR" & NOC != "ISL"& NOC != "DEN"& NOC != "SWE"& NOC != "FIN" & NOC != "ITA"& NOC != "ESP" & NOC != "POR" & NOC != "GRE"& NOC != "SLO"& NOC != "CZE"& NOC != "TCH"& NOC != "MLT" & NOC != "CYP"& NOC != "JPN"&NOC != "KOR"& NOC != "SGP" &NOC!= "SR")
Developing_Country$Country_type <- c("Developing")
Developed_Country$Country_type <- c("Developed")
All <- rbind(Developed_Country,Developing_Country)
Developed <- filter(Topic_Country, NOC == "USA" | NOC == "CAN" | NOC == "AUS"|NOC == "NZL" | NOC == "GBR"|NOC == "IRL" | NOC == "FRA"|NOC == "NED" | NOC == "BEL"|NOC == "LUX" | NOC == "GER"|NOC == "AUT" | NOC == "SUI"|NOC == "NOR" | NOC == "ISL"|NOC == "DEN"| NOC == "SWE"|NOC == "FIN" | NOC == "ITA"|NOC == "ESP" | NOC == "POR"|NOC == "GRE"| NOC == "SLO"|NOC == "CZE"| NOC == "TCH"|NOC == "MLT" |NOC == "CYP"| NOC == "JPN"|NOC == "KOR"| NOC == "SGP"|NOC == "SR" )
Developing <- filter(Topic_Country, NOC != "USA" & NOC != "CAN" & NOC != "AUS"& NOC != "NZL" & NOC != "GBR"& NOC != "IRL" & NOC != "FRA"& NOC != "NED" & NOC != "BEL"& NOC != "LUX" & NOC != "GER"& NOC != "AUT" & NOC != "SUI" & NOC != "NOR" & NOC != "ISL"& NOC != "DEN"& NOC != "SWE"& NOC != "FIN" & NOC != "ITA"& NOC != "ESP" & NOC != "POR" & NOC != "GRE"& NOC != "SLO"& NOC != "CZE"& NOC != "TCH"& NOC != "MLT" & NOC != "CYP"& NOC != "JPN"&NOC != "KOR"& NOC != "SGP" &NOC!= "SR")
Developing$Country_type <- c("Developing") #Developing countries that have won medals for 120 years#
rownames(Developed) <- NULL #Initialize the data and re-arrange the row names in order#
rownames(Developed) <- rownames(Developed, do.NULL = FALSE)
Developed$Country_type <- c("Developed") #Developed countries that have won medals in 120 years#
All_Country <- rbind(Developed,Developing)
#All countries that have won medals in 120 years#
All_Country<- All_Country[order(All_Country[,"All_Medals"],decreasing=T),]
rownames(All_Country) <- NULL #Initialize the data and re-arrange the row names in order#
rownames(All_Country ) <- rownames(All_Country , do.NULL = FALSE)
All_Country
#Drawing#
bar <- ggplot(data = All) +geom_bar(mapping = aes(x = Country_type, fill = Country_type,),show.legend = FALSE,width = 1)+theme(aspect.ratio = 1)
p1 <- bar
Label = as.vector(All_Country $Country_type) #Turn into a vector#
Label1 = paste(Label, "(", round(All_Country $All_Medals/sum(All_Country$All_Medals)*490, ), "%)", sep = "") #Keep three decimal places on the result#
p2 <- ggplot(All_Country , aes(x = '', y = All_Medals, fill = Country_type))+ #Create an axes#
geom_bar(stat = 'identity', width = 1) + #The noise in the center disappears when the width > 1#
coord_polar(theta = 'y')+ #Fold the bar chart into a pie chart (polar coordinates)#
labs(x = "", y = "", title = "") + #Set the label for the horizontal coordinates to empty#
theme(axis.ticks = element_blank()) + # Remove the scale of the top left border#
scale_fill_discrete(breaks = All_Country $Country_type, labels = Label1)+
theme(legend.title = element_blank())+
theme(axis.text.x = element_blank())
grid.arrange(p1, p2, ncol=2) #Proportion#
ggplot(All_Country,aes(x=NOC,y=All_Medals,color=Country_type))+geom_point()+theme(axis.text.x = element_text(angle = 0, hjust = 1, size = 0),
axis.text.y = element_text(angle = 0, size = 10),) #Distribution chart#
After difficult and complex data cleaning, we still obtained the data frame we needed, including the rankings of 149 countries that have won medals in 120 years. First of all, in the first picture, we find that 31 developed countries have taken away 69% of the total number of medals. Secondly, the dotted distribution map of national medals in the second picture also shows that most of the developed countries have relatively high rankings in the number of medals. All these show the strong sports strength of developed countries in the Olympic Games.
This project includes 3 major themes and 8 specific categories, using 7 different types of statistical charts, including bar chart, curve chart, sector chart, polar chart, scatter chart, violin chart and box chart, a large number of data cleaning, creation, and ideas of exploring big data. The above materials show the great information brought to us by the 120-year history of the Olympic Games.